[TRTLLM-8952][feat] Support Multi-Node Disagg Perf Test in CI#9138
Conversation
📝 WalkthroughWalkthroughThis PR introduces multi-node disaggregated performance testing infrastructure for Slurm-based deployments. It adds a new Groovy orchestration method, four Slurm execution scripts (Bash/Python), refactors core test configuration handling to support aggregated and multi-node disaggregated server architectures, extends performance database utilities with GPU type filtering, and introduces new test configurations and execution parameters for multi-node GB200 scenarios. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Jenkins as Jenkins<br/>(L0_Test.groovy)
participant PyOrch as Python<br/>(slurm_exec.py)
participant Slurm as Slurm sbatch
participant Launch as Slurm Launch Script<br/>(slurm_launch.sh)
participant Compute as Compute Nodes
participant Run as Test Runner<br/>(slurm_run.sh)
User->>Jenkins: Trigger multi-node disagg test
Jenkins->>Jenkins: runLLMMultiNodeDisaggTestlistWithSbatch()
Jenkins->>Jenkins: Prepare test files & configs
Jenkins->>PyOrch: Execute slurm_exec.py
PyOrch->>PyOrch: Parse YAML, compute topology<br/>(ctx_servers, gen_servers, total_nodes)
PyOrch->>Slurm: Submit sbatch job
Slurm->>Launch: Allocate nodes & spawn srun
Launch->>Compute: install.sh on each node
Compute->>Compute: Setup Python, install wheel
Launch->>Compute: Start gen servers (srun)
Launch->>Compute: Start ctx servers (srun)
Launch->>Compute: Start disagg server + benchmark (srun)
Compute->>Run: Execute pytest via slurm_run.sh
Run->>Run: Set LD_LIBRARY_PATH<br/>Inject TensorRT wheel
Run->>Compute: Run performance tests
PyOrch->>Slurm: Poll job status via squeue
Slurm-->>PyOrch: Job completion
PyOrch->>PyOrch: Check exit code via sacct
PyOrch-->>Jenkins: Return status
sequenceDiagram
participant Config as Config YAML
participant TestPerf as test_perf.py
participant Parser as Config Parser
participant Commands as Command Gen
Config->>Parser: parse_aggr_config_file() / parse_multi_node_disagg_config_file()
alt Aggregated Server Path
Parser->>TestPerf: ServerConfig(server_config_dict)
Parser->>TestPerf: ClientConfig(client_config_dict)
TestPerf->>Commands: get_trtllm_aggr_commands()
Commands-->>TestPerf: [server_cmds, client_cmds]
else Multi-Node Disaggregated Path
Parser->>TestPerf: disagg_configs[] + ServerConfig
TestPerf->>Commands: get_trtllm_multi_node_disagg_commands()
Commands-->>TestPerf: PerfMultiNodeDisaggScriptTestCmds<br/>(ctx_cmds, gen_cmds, disagg_cmds, benchmark_cmds)
end
TestPerf->>TestPerf: set_aggr_server_configs() / set_multi_node_disagg_server_configs()
TestPerf-->>TestPerf: Update PerfTestConfig with commands
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/integration/defs/perf/utils.py (1)
276-339: Do not passenvtodefs.trt_test_alternative.check_output.
check_outputfromdefs.trt_test_alternativeonly accepts acommandargument (see Line 58 in tests/integration/defs/perf/misc.py). Passingenv=…here throws aTypeError, so aggregated/disagg perf runs will fail before the benchmark starts. Please switch these call sites (Line 276 for the client path, Lines 338-339 for the disagg path, and the multi-node branch later in this file) tosubprocess.check_output(...).decode()or another helper that supportsenv.- output += check_output(self.client_cmds[cmd_idx], - env=venv._new_env) + output_bytes = subprocess.check_output( + self.client_cmds[cmd_idx], env=venv._new_env + ) + output += output_bytes.decode()
🧹 Nitpick comments (9)
jenkins/scripts/perf/disaggregated/install.sh (2)
13-13: Consider runningapt-get updatebefore installing packages.The
apt-get install -y libffi-devcommand might fail if the package lists are stale. Consider addingapt-get update -ybefore the install command to ensure the latest package information is available.Apply this diff:
python3 --version -apt-get install -y libffi-dev +apt-get update -y && apt-get install -y libffi-dev nvidia-smi && nvidia-smi -q && nvidia-smi topo -m
9-10: Add explicit error messages for download and extraction failures.While
set -Eeuo pipefailwill cause the script to exit on failure, adding explicit error checking with informative messages would improve debuggability when tarball downloads or extractions fail.Apply this diff:
-wget -nv $llmTarfile -tar -zxf $tarName +wget -nv $llmTarfile || { echo "Error: Failed to download $llmTarfile"; exit 1; } +tar -zxf $tarName || { echo "Error: Failed to extract $tarName"; exit 1; } which python3jenkins/scripts/perf/disaggregated/slurm_exec.py (1)
178-186: Consider more specific exception handling.The broad
except Exception as e:clause on line 178 catches all exceptions, which could potentially mask programming errors. Consider catching specific exceptions or at least logging more details about unexpected exceptions.Apply this diff to improve error visibility:
- except Exception as e: + except (OSError, subprocess.SubprocessError) as e: print(f"Error during job monitoring: {e}") + import traceback + traceback.print_exc() tail_proc.terminate()tests/integration/defs/perf/test_perf.py (6)
570-576: Add else clause for unexpected types in eagle3_layers_to_capture.The code handles
intandlisttypes but silently defaults to an empty list for any other type. Consider adding logging or an assertion to catch unexpected types during development.Apply this diff to add defensive handling:
eagle3_value = speculative_config.get('eagle3_layers_to_capture', []) if isinstance(eagle3_value, int): self.eagle3_layers_to_capture = [eagle3_value] elif isinstance(eagle3_value, list): self.eagle3_layers_to_capture = eagle3_value else: + print_warning(f"Unexpected type for eagle3_layers_to_capture: {type(eagle3_value)}, defaulting to empty list") self.eagle3_layers_to_capture = []
721-747: Consider removing unused parameter if truly unnecessary.Static analysis flags
working_diras unused. If it's not needed for future extensions, consider removing it to reduce confusion.
808-808: Add explicit Optional type hint.Static analysis recommends using explicit
Optional[str]orstr | Noneinstead of implicitNonedefault for theselect_patternparameter.Apply this diff:
-def parse_aggr_config_file(config_file_path: str, select_pattern: str = None): +def parse_aggr_config_file(config_file_path: str, select_pattern: str | None = None):
868-925: Add explicit Optional type hint and document port selection logic.
- Static analysis recommends explicit
Optional[str]orstr | Nonefor theselect_patternparameter.- The port selection logic (lines 877-879) is unclear: why does having both "CTX" and "GEN" in the index result in port 8336 vs 8333? Add a comment explaining this distinction.
Apply this diff:
-def parse_multi_node_disagg_config_file(config_file_path: str, - select_pattern: str = None): +def parse_multi_node_disagg_config_file(config_file_path: str, + select_pattern: str | None = None): # Get disagg_server_idx from environment variable disagg_server_idx = os.environ.get("DISAGG_SERVER_IDX", "DISAGG") # Get hostname hostname = socket.gethostname() - # Determine port based on disagg_server_idx + # Determine port based on disagg_server_idx + # Worker nodes (CTX/GEN) use port 8336, orchestrator node uses 8333 port = 8333 if "CTX" in disagg_server_idx and "GEN" in disagg_server_idx: port = 8336
2499-2500: Add explicit Optional type hints.Static analysis recommends explicit
Optionalor union withNonefor the optional parameters.Apply this diff:
def _get_metric_name(self, metric_type: PerfMetricType, bs: int = None, input_len: int = None, output_len: int = None, server_name: str = None, - client_name: str = None, - disagg_config_name: str = None) -> str: + client_name: str | None = None, + disagg_config_name: str | None = None) -> str:
2542-2549: Consider extracting exception messages to constants.Static analysis suggests avoiding long exception messages inline. While this is a minor style issue, extracting these to constants could improve maintainability.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
jenkins/L0_Test.groovy(3 hunks)jenkins/scripts/perf/disaggregated/install.sh(1 hunks)jenkins/scripts/perf/disaggregated/slurm_exec.py(1 hunks)jenkins/scripts/perf/disaggregated/slurm_launch.sh(1 hunks)jenkins/scripts/perf/disaggregated/slurm_run.sh(1 hunks)tests/integration/defs/perf/open_search_db_utils.py(6 hunks)tests/integration/defs/perf/test_perf.py(30 hunks)tests/integration/defs/perf/utils.py(8 hunks)tests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.yml(3 hunks)tests/integration/test_lists/test-db/perf_sanity_l0_gb200_multi_nodes.yml(1 hunks)tests/scripts/perf-sanity/l0_dgx_b200.yaml(1 hunks)tests/scripts/perf-sanity/l0_dgx_b300.yaml(1 hunks)tests/scripts/perf-sanity/l0_gb200_multi_nodes.yaml(1 hunks)tests/scripts/perf-sanity/l0_gb200_multi_nodes_disagg.yaml(1 hunks)
🧰 Additional context used
🧠 Learnings (10)
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/scripts/perf-sanity/l0_dgx_b300.yamltests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.ymljenkins/scripts/perf/disaggregated/slurm_run.shtests/scripts/perf-sanity/l0_gb200_multi_nodes.yamltests/integration/test_lists/test-db/perf_sanity_l0_gb200_multi_nodes.ymljenkins/L0_Test.groovytests/scripts/perf-sanity/l0_dgx_b200.yaml
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
jenkins/scripts/perf/disaggregated/install.shjenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-08-11T20:09:24.389Z
Learnt from: achartier
Repo: NVIDIA/TensorRT-LLM PR: 6763
File: tests/integration/defs/triton_server/conftest.py:16-22
Timestamp: 2025-08-11T20:09:24.389Z
Learning: In the TensorRT-LLM test infrastructure, the team prefers simple, direct solutions (like hard-coding directory traversal counts) over more complex but robust approaches when dealing with stable directory structures. They accept the maintenance cost of updating tests if the layout changes.
Applied to files:
jenkins/scripts/perf/disaggregated/install.shjenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
jenkins/scripts/perf/disaggregated/install.sh
📚 Learning: 2025-08-18T09:08:07.687Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 6984
File: cpp/tensorrt_llm/CMakeLists.txt:297-299
Timestamp: 2025-08-18T09:08:07.687Z
Learning: In the TensorRT-LLM project, artifacts are manually copied rather than installed via `cmake --install`, so INSTALL_RPATH properties are not needed - only BUILD_RPATH affects the final artifacts.
Applied to files:
jenkins/scripts/perf/disaggregated/install.sh
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
jenkins/scripts/perf/disaggregated/install.shtests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.ymljenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-09-16T09:30:09.716Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7763
File: cpp/tensorrt_llm/CMakeLists.txt:297-301
Timestamp: 2025-09-16T09:30:09.716Z
Learning: In the TensorRT-LLM project, NCCL libraries are loaded earlier by PyTorch libraries or the bindings library, so the main shared library doesn't need NCCL paths in its RPATH - the libraries will already be available in the process address space when needed.
Applied to files:
jenkins/scripts/perf/disaggregated/install.sh
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.ymljenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-08-29T14:07:45.863Z
Learnt from: EmmaQiaoCh
Repo: NVIDIA/TensorRT-LLM PR: 7370
File: tests/unittest/trt/model_api/test_model_quantization.py:24-27
Timestamp: 2025-08-29T14:07:45.863Z
Learning: In TensorRT-LLM's CI infrastructure, pytest skip markers (pytest.mark.skip) are properly honored even when test files have __main__ blocks that call test functions directly. The testing system correctly skips tests without requiring modifications to the __main__ block execution pattern.
Applied to files:
jenkins/scripts/perf/disaggregated/slurm_run.sh
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/integration/defs/perf/test_perf.py
🧬 Code graph analysis (3)
jenkins/scripts/perf/disaggregated/slurm_run.sh (1)
tests/unittest/llmapi/apps/_test_disagg_serving_multi_nodes.py (1)
env(61-68)
tests/integration/defs/perf/utils.py (2)
tests/integration/defs/perf/misc.py (1)
check_output(59-69)tests/integration/defs/trt_test_alternative.py (4)
popen(199-218)print_error(318-324)print_info(300-306)print_warning(309-315)
tests/integration/defs/perf/test_perf.py (1)
tests/integration/defs/perf/utils.py (7)
AbstractPerfScriptTestClass(579-894)PerfAggrScriptTestCmds(240-288)PerfBenchScriptTestCmds(133-237)PerfDisaggScriptTestCmds(291-350)PerfMetricType(89-119)PerfMultiNodeDisaggScriptTestCmds(353-576)generate_test_nodes(950-984)
🪛 Ruff (0.14.4)
jenkins/scripts/perf/disaggregated/slurm_exec.py
1-1: Shebang is present but file is not executable
(EXE001)
103-103: subprocess call with shell=True identified, security issue
(S602)
127-127: subprocess call: check for execution of untrusted input
(S603)
127-127: Starting a process with a partial executable path
(S607)
128-128: subprocess call: check for execution of untrusted input
(S603)
129-129: Starting a process with a partial executable path
(S607)
139-139: subprocess call with shell=True identified, security issue
(S602)
157-157: subprocess call with shell=True identified, security issue
(S602)
178-178: Do not catch blind exception: Exception
(BLE001)
tests/integration/defs/perf/open_search_db_utils.py
362-363: Avoid equality comparisons to True; use ...: for truth checks
Replace comparison
(E712)
tests/integration/defs/perf/utils.py
251-251: Probable use of requests call without timeout
(S113)
254-254: Do not catch blind exception: Exception
(BLE001)
275-275: f-string without any placeholders
Remove extraneous f prefix
(F541)
377-378: Avoid specifying long messages outside the exception class
(TRY003)
416-416: Undefined name num_ctx_servers
(F821)
420-420: Undefined name num_gen_servers
(F821)
437-437: Probable use of requests call without timeout
(S113)
440-440: Do not catch blind exception: Exception
(BLE001)
484-484: Function call with shell=True parameter identified, security issue
(S604)
519-519: Function call with shell=True parameter identified, security issue
(S604)
526-526: f-string without any placeholders
Remove extraneous f prefix
(F541)
565-565: f-string without any placeholders
Remove extraneous f prefix
(F541)
573-573: Unused method argument: cmd_idx
(ARG002)
752-752: f-string without any placeholders
Remove extraneous f prefix
(F541)
tests/integration/defs/perf/test_perf.py
722-722: Unused method argument: working_dir
(ARG002)
808-808: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
869-869: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
2499-2499: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
2500-2500: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
2544-2544: Avoid specifying long messages outside the exception class
(TRY003)
2548-2548: Avoid specifying long messages outside the exception class
(TRY003)
🪛 Shellcheck (0.11.0)
jenkins/scripts/perf/disaggregated/install.sh
[warning] 4-4: resourcePathNode is referenced but not assigned.
(SC2154)
[warning] 7-7: jobWorkspace is referenced but not assigned.
(SC2154)
[warning] 9-9: llmTarfile is referenced but not assigned.
(SC2154)
[warning] 10-10: tarName is referenced but not assigned.
(SC2154)
[warning] 15-15: pytestCommand is referenced but not assigned.
(SC2154)
[warning] 23-23: stageName is referenced but not assigned.
(SC2154)
jenkins/scripts/perf/disaggregated/slurm_launch.sh
[warning] 18-18: configYamlFile appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 19-19: testListPathNode appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 50-50: Declare and assign separately to avoid masking return values.
(SC2155)
[warning] 104-104: Prefer mapfile or read -a to split command output (or quote to avoid splitting).
(SC2207)
[error] 106-106: Argument mixes string and array. Use * or separate argument.
(SC2145)
[warning] 109-109: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[warning] 110-110: Quote to prevent word splitting/globbing, or split robustly with mapfile or read -a.
(SC2206)
[error] 112-112: Argument mixes string and array. Use * or separate argument.
(SC2145)
[error] 113-113: Argument mixes string and array. Use * or separate argument.
(SC2145)
jenkins/scripts/perf/disaggregated/slurm_run.sh
[warning] 5-5: rc is referenced but not assigned.
(SC2154)
[warning] 7-7: resourcePathNode is referenced but not assigned.
(SC2154)
[warning] 42-42: coverageConfigFile is referenced but not assigned.
(SC2154)
[warning] 77-77: perfMode is referenced but not assigned.
(SC2154)
[warning] 78-78: stageName is referenced but not assigned.
(SC2154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (25)
tests/integration/test_lists/test-db/perf_sanity_l0_dgx_b200.yml (2)
3-36: LGTM: Clean test matrix expansion for 8-GPU configuration.The expansion from a single entry to separate pre_merge and post_merge blocks with explicit conditions is well-structured and follows the existing pattern used for 4-GPU configurations below.
52-54: LGTM: Consistent test variant additions across stages.The three new test variants (r1_fp4_v2 with different parallelism settings and gpt_oss_fp4 configurations) are consistently added to both pre_merge and post_merge stages, properly expanding FP4-based test coverage.
Also applies to: 71-73
tests/scripts/perf-sanity/l0_dgx_b200.yaml (1)
1-231: LGTM: Well-structured configuration modernization.The restructuring introduces a consistent nested configuration schema across all seven server configurations. The standardization of client config naming (e.g.,
con1024_iter10_1k1k) and the migration to nested blocks (attention_dp_config,moe_config,cuda_graph_config,kv_cache_config,speculative_config) significantly improves maintainability and aligns with the broader schema modernization across the repository.tests/scripts/perf-sanity/l0_gb200_multi_nodes.yaml (2)
1-66: LGTM: Multi-node configuration properly structured.The two server configurations correctly specify
gpus_per_node: 4for multi-node distribution. The nested configuration structure is consistent with other perf-sanity YAMLs in this PR.
26-26: Verify the reduced memory fraction for multi-node scenarios.Both configurations use
free_gpu_memory_fraction: 0.5, which is lower than the 0.8 typically used in single-node configurations (e.g., l0_dgx_b200.yaml). Confirm this reduction is intentional for multi-node scenarios and accounts for additional memory overhead from inter-node communication.Also applies to: 55-55
tests/integration/test_lists/test-db/perf_sanity_l0_gb200_multi_nodes.yml (1)
1-31: LGTM: Clean multi-node test configuration.The test list properly configures multi-node GB200 testing with clear topology documentation (2 nodes × 4 GPUs). The structure is consistent with existing test list patterns, and both pre_merge and post_merge stages are appropriately configured.
jenkins/scripts/perf/disaggregated/install.sh (2)
1-23: Good error handling withset -Eeuo pipefail.The script properly uses strict error handling, which will cause it to exit immediately on any command failure, undefined variable, or pipe error.
4-4: Environment variables are provided by the SLURM orchestration context.Shellcheck warnings about undefined variables (
resourcePathNode,jobWorkspace,llmTarfile,tarName,pytestCommand,stageName) can be safely ignored. These variables are exported by the calling SLURM launch script and are part of the documented execution contract for this node initialization script.Also applies to: 7-7, 9-10, 15-15, 23-23
tests/scripts/perf-sanity/l0_dgx_b300.yaml (1)
1-66: LGTM: Configuration mirrors B200 structure for B300 platform.The two server configurations follow the same nested schema as l0_dgx_b200.yaml, appropriately adapted for the B300 platform. The consistent structure across platform variants simplifies maintenance and reduces configuration drift.
tests/scripts/perf-sanity/l0_gb200_multi_nodes_disagg.yaml (2)
1-69: LGTM: Well-structured disaggregated configuration.The disagg_configs structure properly separates hardware topology, timeout, benchmark parameters, and per-server (gen/ctx) configurations. The nested structure is consistent with other perf-sanity YAMLs in this PR.
46-48: Clarify the speculative_config conditional removal guidance.The comments "when mtp_size is 0, remove this section" appear in both gen and ctx configurations. Verify whether this conditional logic should be handled automatically by the test infrastructure or if it requires manual configuration file editing for different test scenarios.
If this is meant to be handled automatically, consider documenting the logic in the test infrastructure code or adding validation that ensures the speculative_config is consistent with the mtp_size parameter.
Also applies to: 67-69
jenkins/scripts/perf/disaggregated/slurm_exec.py (2)
1-190: LGTM: Well-structured SLURM orchestration with proper error handling.The script provides comprehensive SLURM job orchestration with proper argument parsing, topology computation from YAML config, job submission, monitoring, and cleanup. The exit code handling and tail process management are correctly implemented.
103-103:shell=Trueusage is acceptable in this CI context.While static analysis flags
shell=Trueas a security concern, the usage here is appropriate for a CI orchestration script where:
- The sbatch command is constructed from validated arguments
- The script runs in a controlled Jenkins/SLURM environment
- The complexity of the multi-line sbatch command would be difficult to express without shell interpretation
The security risk is minimal in this context since all inputs are from trusted Jenkins configuration.
Also applies to: 139-139, 157-157
tests/integration/defs/perf/test_perf.py (12)
21-21: LGTM!The
socketimport is appropriately used for hostname resolution in the multi-node disaggregated configuration.
318-351: LGTM!The rename from
SERVER_BENCHMARK_PERF_METRIC_LOG_QUERIEStoAGGR_SERVER_PERF_METRIC_LOG_QUERIESclarifies that these metrics are specific to aggregated server testing, distinguishing them from the new multi-node disaggregated metrics.
588-601: LGTM!The addition of
hostnameandportparameters toto_cmd()enables multi-node deployments while maintaining backward compatibility through sensible defaults.
603-681: LGTM!The expanded
to_db_data()method comprehensively captures all server configuration fields for database storage, including the new nested configuration sections.
683-698: LGTM!Good defensive coding: creates a copy of the config data before modifying it and properly handles path conversion for
speculative_model_dir.
706-719: LGTM!The refactored
ClientConfigconstructor follows the same pattern asServerConfig, using a dictionary for configuration data with appropriate defaults for optional fields.
1043-1049: LGTM!The addition of
disagg_configstoPerfTestConfigis well-documented and follows the same pattern as the aggregated mode configuration fields.
1078-1088: LGTM!The naming convention clearly distinguishes aggregated server tests (using
server:andclient:prefixes) from multi-node disaggregated tests (usingdisagg:prefix).
1494-1508: LGTM!The configuration setter methods follow a consistent pattern and properly delegate to the respective parsing functions.
1668-1723: Verify shared filesystem assumption for multi-node coordination.The code writes hostname files to a shared
hostnamesdirectory (lines 1669-1671, 1683-1686) that all nodes must access. This assumes a shared filesystem across nodes. Verify this assumption is valid for the target deployment environment, or add documentation about this requirement.
2389-2409: LGTM!Good design decision to reuse
AGGR_SERVER_METRICSfor multi-node disaggregated testing, as both modes measure similar performance characteristics.
1264-1270: The review comment is incorrect; no changes are needed.The code at line 1267 uses
"_multi_nodes_disagg"(with 's'), which correctly matches the actual config file naming intests/scripts/perf-sanity/(e.g.,l0_gb200_multi_nodes_disagg.yaml). Theself.runtimeconstant"multi_node_disagg_server"is a separate label that doesn't need to match the file naming pattern—it serves a different purpose. The code is correct as-is.Likely an incorrect or invalid review comment.
d896ea1 to
c6c3fae
Compare
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1" |
|
PR_Github #24970 [ run ] triggered by Bot. Commit: |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1" |
|
PR_Github #25017 [ run ] triggered by Bot. Commit: |
|
PR_Github #25017 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1" |
|
PR_Github #25444 [ run ] triggered by Bot. Commit: |
|
PR_Github #25444 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1" |
|
PR_Github #25475 [ run ] triggered by Bot. Commit: |
|
PR_Github #25475 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Disagg-Post-Merge-1" |
|
PR_Github #25509 [ run ] triggered by Bot. Commit: |
|
PR_Github #25509 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Disagg-Post-Merge-1" |
|
PR_Github #25522 [ run ] triggered by Bot. Commit: |
|
PR_Github #25522 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Disagg-Post-Merge-1" |
|
PR_Github #25645 [ run ] triggered by Bot. Commit: |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-1" |
|
PR_Github #29966 [ run ] triggered by Bot. Commit: |
875e510 to
97e32b7
Compare
|
/bot run |
|
PR_Github #29966 [ run ] completed with state |
|
PR_Github #29970 [ run ] triggered by Bot. Commit: |
|
PR_Github #29970 [ run ] completed with state
|
|
/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Ray-1,GB200-12_GPUs-3_Nodes-PyTorch-Perf-Sanity-Disagg-Post-Merge-1" |
|
PR_Github #29981 [ run ] triggered by Bot. Commit: |
|
PR_Github #29981 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-Perf-Sanity-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-2" |
|
PR_Github #29983 [ run ] triggered by Bot. Commit: |
|
PR_Github #29984 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #29984 [ run ] completed with state |
|
PR_Github #29988 [ run ] triggered by Bot. Commit: |
|
PR_Github #29988 [ run ] completed with state
|
|
/bot run |
|
PR_Github #30003 [ run ] triggered by Bot. Commit: |
|
PR_Github #30003 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #30005 [ run ] triggered by Bot. Commit: |
|
PR_Github #30005 [ run ] completed with state |
…#9138) Signed-off-by: Chenfei Zhang <chenfeiz@nvidia.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
Summary by CodeRabbit
Release Notes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.